Máster en Data Science¶
Proyecto de Machine Learning¶
Álvaro Rull y Salvador Heras¶
Contactos: alvaro.rull@cunef.edu ; salvador.heras@cunef.edu
Bloque 1: cargar y entender los datos¶
Importo librerias¶
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import scipy.stats as ss
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, \
silhouette_score, recall_score, precision_score, make_scorer, \
roc_auc_score, f1_score, precision_recall_curve, accuracy_score, roc_auc_score, \
classification_report, confusion_matrix
import sklearn
from sklearn.pipeline import Pipeline
from sklearn import metrics
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)
importo funciones que usaremos en las siguientes celdas¶
def dame_variables_categoricas(dataset=None, max_valores_distintos=100):
'''
----------------------------------------------------------------------------------------------------------
Función dame_variables_categoricas:
----------------------------------------------------------------------------------------------------------
-Descripción: Función que recibe un dataset y devuelve una lista con los nombres de las
variables categóricas
-Inputs:
-- dataset: Pandas dataframe que contiene los datos
-- max_valores_distintos: Número máximo de valores distintos permitidos para considerar
una variable como categórica (por defecto 100)
-Return:
-- lista_variables_categoricas: lista con los nombres de las variables categóricas del
dataset de entrada con menos de max_valores_distintos valores diferentes
-- 1: la ejecución es incorrecta
'''
if dataset is None:
print(u'\nFaltan argumentos por pasar a la función')
return 1
lista_variables_categoricas = []
for columna in dataset.columns:
if pd.api.types.is_object_dtype(dataset[columna]):
# Si el tipo de dato es 'object' (categórico)
if len(dataset[columna].dropna().unique()) < max_valores_distintos:
lista_variables_categoricas.append(columna)
return lista_variables_categoricas
def get_corr_matrix(dataset = None, metodo='pearson', size_figure=[10,8]):
# Para obtener la correlación de Spearman, sólo cambiar el metodo por 'spearman'
if dataset is None:
print(u'\nHace falta pasar argumentos a la función')
return 1
sns.set(style="white")
# Compute the correlation matrix
corr = dataset.corr(method=metodo)
# Set self-correlation to zero to avoid distraction
for i in range(corr.shape[0]):
corr.iloc[i, i] = 0
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=size_figure)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, center=0,
square=True, linewidths=.5, cmap ='viridis' ) #cbar_kws={"shrink": .5}
plt.show()
return 0
def cramers_v(confusion_matrix):
"""
calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
confusion_matrix: tabla creada con pd.crosstab()
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if np.array_equal(ia, ja):
dups.append(cs[i])
break
return dups
DATASET¶
df_fraud = pd.read_csv('../data/Base.csv')
Análisis general de la tabla, valores missing/nulos y de la variable objetivo¶
print(df_fraud.shape, df_fraud.drop_duplicates().shape)
(1000000, 32) (1000000, 32)
df_fraud.dtypes.sort_values().to_frame('feature_type').groupby(by = 'feature_type').size().to_frame('count').reset_index()
| feature_type | count | |
|---|---|---|
| 0 | int64 | 18 |
| 1 | float64 | 9 |
| 2 | object | 5 |
df_fraud.dtypes.to_dict()
{'fraud_bool': dtype('int64'),
'income': dtype('float64'),
'name_email_similarity': dtype('float64'),
'prev_address_months_count': dtype('int64'),
'current_address_months_count': dtype('int64'),
'customer_age': dtype('int64'),
'days_since_request': dtype('float64'),
'intended_balcon_amount': dtype('float64'),
'payment_type': dtype('O'),
'zip_count_4w': dtype('int64'),
'velocity_6h': dtype('float64'),
'velocity_24h': dtype('float64'),
'velocity_4w': dtype('float64'),
'bank_branch_count_8w': dtype('int64'),
'date_of_birth_distinct_emails_4w': dtype('int64'),
'employment_status': dtype('O'),
'credit_risk_score': dtype('int64'),
'email_is_free': dtype('int64'),
'housing_status': dtype('O'),
'phone_home_valid': dtype('int64'),
'phone_mobile_valid': dtype('int64'),
'bank_months_count': dtype('int64'),
'has_other_cards': dtype('int64'),
'proposed_credit_limit': dtype('float64'),
'foreign_request': dtype('int64'),
'source': dtype('O'),
'session_length_in_minutes': dtype('float64'),
'device_os': dtype('O'),
'keep_alive_session': dtype('int64'),
'device_distinct_emails_8w': dtype('int64'),
'device_fraud_count': dtype('int64'),
'month': dtype('int64')}
- valores nulos, missing o duplicados
# Initialize an empty DataFrame to hold the percentage of missing values for each feature
missing_vals = pd.DataFrame()
# List of features to check for missing values
missing_features = ['prev_address_months_count', 'current_address_months_count', 'intended_balcon_amount', 'bank_months_count', 'session_length_in_minutes', 'device_distinct_emails_8w']
# For each feature, replace -1 values with NaN, calculate the percentage of missing values, and add to the missing_vals DataFrame
for feature in missing_features:
df_fraud.loc[df_fraud[feature] < 0, feature] = np.nan # df[feature] = df[feature].replace(-1, np.nan)
missing_vals_col = df_fraud.groupby('fraud_bool')[feature].apply(lambda x: round(x.isna().sum()/len(x) * 100, 2))
missing_vals[feature] = missing_vals_col
print(missing_vals)
prev_address_months_count current_address_months_count \
fraud_bool
0 71.06 0.43
1 91.89 0.13
intended_balcon_amount bank_months_count \
fraud_bool
0 74.09 25.23
1 88.38 37.54
session_length_in_minutes device_distinct_emails_8w
fraud_bool
0 0.20 0.04
1 0.16 0.04
Se crea un DataFrame vacío llamado missing_vals para almacenar el porcentaje de valores faltantes para cada característica.
Se especifica una lista de características llamada missing_features que se deben verificar para valores faltantes.
Se itera sobre cada característica en la lista missing_features.
Se reemplazan los valores -1 con NaN en el DataFrame df_fraud para cada característica.
Se calcula el porcentaje de valores faltantes para la característica actual, agrupado por la columna 'fraud_bool'.
Los porcentajes calculados se agregan al DataFrame missing_vals.
Se imprime el DataFrame missing_vals, que contiene el porcentaje de valores faltantes para cada característica, agrupado por la columna 'fraud_bool'.
# Reshape the missing_vals DataFrame from wide to long format
missing_vals = pd.DataFrame(missing_vals.T.stack())
# Reset the index and rename the columns
missing_vals.reset_index(inplace=True)
missing_vals.rename(columns={'level_0': 'feature', 0: 'missing_vals'}, inplace=True)
print(missing_vals)
feature fraud_bool missing_vals 0 prev_address_months_count 0 71.06 1 prev_address_months_count 1 91.89 2 current_address_months_count 0 0.43 3 current_address_months_count 1 0.13 4 intended_balcon_amount 0 74.09 5 intended_balcon_amount 1 88.38 6 bank_months_count 0 25.23 7 bank_months_count 1 37.54 8 session_length_in_minutes 0 0.20 9 session_length_in_minutes 1 0.16 10 device_distinct_emails_8w 0 0.04 11 device_distinct_emails_8w 1 0.04
Nos aseguramos que los -1 sean NaN¶
df_fraud.head()
| fraud_bool | income | name_email_similarity | prev_address_months_count | current_address_months_count | customer_age | days_since_request | intended_balcon_amount | payment_type | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | date_of_birth_distinct_emails_4w | employment_status | credit_risk_score | email_is_free | housing_status | phone_home_valid | phone_mobile_valid | bank_months_count | has_other_cards | proposed_credit_limit | foreign_request | source | session_length_in_minutes | device_os | keep_alive_session | device_distinct_emails_8w | device_fraud_count | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.9 | 0.166828 | NaN | 88.0 | 50 | 0.020925 | NaN | AA | 769 | 10650.765523 | 3134.319630 | 3863.647740 | 1 | 6 | CA | 185 | 0 | BA | 1 | 0 | 24.0 | 0 | 500.0 | 0 | INTERNET | 3.888115 | windows | 0 | 1.0 | 0 | 7 |
| 1 | 1 | 0.9 | 0.296286 | NaN | 144.0 | 50 | 0.005418 | NaN | AB | 366 | 534.047319 | 2670.918292 | 3124.298166 | 718 | 3 | CA | 259 | 1 | BA | 0 | 0 | 15.0 | 0 | 1500.0 | 0 | INTERNET | 31.798819 | windows | 0 | 1.0 | 0 | 7 |
| 2 | 1 | 0.9 | 0.044985 | NaN | 132.0 | 40 | 3.108549 | NaN | AC | 870 | 4048.534263 | 2893.621498 | 3159.590679 | 1 | 14 | CB | 177 | 1 | BA | 0 | 1 | NaN | 0 | 200.0 | 0 | INTERNET | 4.728705 | other | 0 | 1.0 | 0 | 7 |
| 3 | 1 | 0.9 | 0.159511 | NaN | 22.0 | 50 | 0.019079 | NaN | AB | 810 | 3457.064063 | 4054.908412 | 3022.261812 | 1921 | 6 | CA | 110 | 1 | BA | 0 | 1 | 31.0 | 1 | 200.0 | 0 | INTERNET | 2.047904 | linux | 0 | 1.0 | 0 | 7 |
| 4 | 1 | 0.9 | 0.596414 | NaN | 218.0 | 50 | 0.004441 | NaN | AB | 890 | 5020.341679 | 2728.237159 | 3087.670952 | 1990 | 2 | CA | 295 | 1 | BA | 1 | 0 | 31.0 | 0 | 1500.0 | 0 | INTERNET | 3.775225 | macintosh | 1 | 1.0 | 0 | 7 |
duplicate_cols = duplicate_columns(df_fraud)
duplicate_cols
[]
df_fraud.shape
(1000000, 32)
- variale objetivo
fraud_vals = pd.DataFrame(df_fraud['fraud_bool'].value_counts())
print(fraud_vals)
count fraud_bool 0 988971 1 11029
df_fraud['fraud_bool'].dtype
dtype('int64')
df_plot_fraud_bool = df_fraud['fraud_bool']\
.value_counts(normalize=True)\
.mul(100).rename('percent').reset_index()
df_plot_fraud_bool_conteo = df_fraud['fraud_bool'].value_counts().reset_index()
df_plot_fraud_bool_pc = pd.merge(df_plot_fraud_bool, df_plot_fraud_bool_conteo, on=['fraud_bool'], how='inner')
df_plot_fraud_bool_pc
| fraud_bool | percent | count | |
|---|---|---|---|
| 0 | 0 | 98.8971 | 988971 |
| 1 | 1 | 1.1029 | 11029 |
fig = px.histogram(df_plot_fraud_bool_pc, x='fraud_bool', y=['percent'])
fig.show()
Tratamiento de Outliers¶
# Create a list of numeric features in the DataFrame df
list_var_num = [x for x in df_fraud.columns if df_fraud[x].nunique() >= 10]
print(list_var_num)
['name_email_similarity', 'prev_address_months_count', 'current_address_months_count', 'days_since_request', 'intended_balcon_amount', 'zip_count_4w', 'velocity_6h', 'velocity_24h', 'velocity_4w', 'bank_branch_count_8w', 'date_of_birth_distinct_emails_4w', 'credit_risk_score', 'bank_months_count', 'proposed_credit_limit', 'session_length_in_minutes']
Guardamos las variables numéricas como list_var_num
# Create a grid of subplots
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(15, 15))
# Add a title to the figure
fig.suptitle('Distribution de variables numericas según el tipo de fraude')
# Loop through the numeric features and plot a kernel density plot for each feature
for i, feature in enumerate(list_var_num):
ax = axes[i // 3][i % 3]
sns.kdeplot(data=df_fraud[df_fraud['fraud_bool'] == 0][feature], fill=True, ax=ax, label='No Fraude')
sns.kdeplot(data=df_fraud[df_fraud['fraud_bool'] == 1][feature], fill=True, ax=ax, label='Fraude')
ax.set_xlabel(feature)
ax.legend()
# Adjust the layout and display the plot
plt.tight_layout()
plt.show()
# Create a grid of subplots
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(15, 15))
# Add a title to the figure
fig.suptitle('Box Plot de variables númericas según el tipo de fraude')
# Loop through the numeric features and plot a box plot for each feature
for i, feature in enumerate(list_var_num):
ax = axes[i // 3][i % 3]
sns.boxplot(data=df_fraud, x='fraud_bool', y=feature, ax=ax, palette='Set1', boxprops=dict(alpha=.6))
ax.set_xlabel('')
ax.set_ylabel(feature)
ax.set_xticklabels(['Not Fraud', 'Fraud'])
# Adjust the layout and display the plot
plt.tight_layout()
plt.show()
Removing outliers in the context of imbalanced data can pose challenges. Outliers may contain valuable information or represent rare instances of the minority class. Their removal can lead to information loss, distort the class distribution, introduce bias towards the majority class, and deviate from real-world scenarios.
Considering the potential impact on model performance, it is important to exercise caution and explore alternative approaches that preserve the integrity of the imbalanced data while addressing outliers, such as robust modeling techniques or outlier detection methods specifically designed for imbalanced datasets.
Tratamiento de variables categóricas y continuas¶
list_var_cat = dame_variables_categoricas(dataset=df_fraud)
df_fraud[list_var_cat] = df_fraud[list_var_cat].astype("category")
list_var_continuous = list(df_fraud.select_dtypes('float').columns)
df_fraud.dtypes
fraud_bool int64 income float64 name_email_similarity float64 prev_address_months_count float64 current_address_months_count float64 customer_age int64 days_since_request float64 intended_balcon_amount float64 payment_type category zip_count_4w int64 velocity_6h float64 velocity_24h float64 velocity_4w float64 bank_branch_count_8w int64 date_of_birth_distinct_emails_4w int64 employment_status category credit_risk_score int64 email_is_free int64 housing_status category phone_home_valid int64 phone_mobile_valid int64 bank_months_count float64 has_other_cards int64 proposed_credit_limit float64 foreign_request int64 source category session_length_in_minutes float64 device_os category keep_alive_session int64 device_distinct_emails_8w float64 device_fraud_count int64 month int64 dtype: object
list_var_cat
['payment_type', 'employment_status', 'housing_status', 'source', 'device_os']
list_var_continuous
['income', 'name_email_similarity', 'prev_address_months_count', 'current_address_months_count', 'days_since_request', 'intended_balcon_amount', 'velocity_6h', 'velocity_24h', 'velocity_4w', 'bank_months_count', 'proposed_credit_limit', 'session_length_in_minutes', 'device_distinct_emails_8w']
Guardamos las variables categoricas como list_var_cat y las continuas (float) como list_var_continuous
confusion_matrix = pd.crosstab(df_fraud["payment_type"], df_fraud["housing_status"])
print(confusion_matrix)
cramers_v(confusion_matrix.values)
housing_status BA BB BC BD BE BF BG payment_type AA 38292 39686 113009 7324 59077 685 176 AB 67765 118946 120535 9092 53689 493 34 AC 43921 78343 91510 6784 31130 356 27 AD 19653 23932 46974 2955 25173 135 15 AE 44 58 115 6 66 0 0
0.09833576596316075
Correlaciones¶
get_corr_matrix(dataset = df_fraud[list_var_num],
metodo='pearson', size_figure=[8,6])
0
One-Hot encoding¶
# Paso 1: Obtener la lista de columnas categóricas en todo el conjunto de datos original
list_columns_cat = list(df_fraud.select_dtypes(include=["object", "category"]).columns)
# Paso 2: Realizar el one-hot encoding en todo el conjunto de datos original
ohe = ce.OneHotEncoder(cols=list_columns_cat)
df_fraud = ohe.fit_transform(df_fraud)
list_columns_cat
['payment_type', 'employment_status', 'housing_status', 'source', 'device_os']
df_fraud.head()
| fraud_bool | income | name_email_similarity | prev_address_months_count | current_address_months_count | customer_age | days_since_request | intended_balcon_amount | payment_type_1 | payment_type_2 | payment_type_3 | payment_type_4 | payment_type_5 | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | date_of_birth_distinct_emails_4w | employment_status_1 | employment_status_2 | employment_status_3 | employment_status_4 | employment_status_5 | employment_status_6 | employment_status_7 | credit_risk_score | email_is_free | housing_status_1 | housing_status_2 | housing_status_3 | housing_status_4 | housing_status_5 | housing_status_6 | housing_status_7 | phone_home_valid | phone_mobile_valid | bank_months_count | has_other_cards | proposed_credit_limit | foreign_request | source_1 | source_2 | session_length_in_minutes | device_os_1 | device_os_2 | device_os_3 | device_os_4 | device_os_5 | keep_alive_session | device_distinct_emails_8w | device_fraud_count | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.9 | 0.166828 | NaN | 88.0 | 50 | 0.020925 | NaN | 1 | 0 | 0 | 0 | 0 | 769 | 10650.765523 | 3134.319630 | 3863.647740 | 1 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 185 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 24.0 | 0 | 500.0 | 0 | 1 | 0 | 3.888115 | 1 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 7 |
| 1 | 1 | 0.9 | 0.296286 | NaN | 144.0 | 50 | 0.005418 | NaN | 0 | 1 | 0 | 0 | 0 | 366 | 534.047319 | 2670.918292 | 3124.298166 | 718 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 259 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15.0 | 0 | 1500.0 | 0 | 1 | 0 | 31.798819 | 1 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 7 |
| 2 | 1 | 0.9 | 0.044985 | NaN | 132.0 | 40 | 3.108549 | NaN | 0 | 0 | 1 | 0 | 0 | 870 | 4048.534263 | 2893.621498 | 3159.590679 | 1 | 14 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 177 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NaN | 0 | 200.0 | 0 | 1 | 0 | 4.728705 | 0 | 1 | 0 | 0 | 0 | 0 | 1.0 | 0 | 7 |
| 3 | 1 | 0.9 | 0.159511 | NaN | 22.0 | 50 | 0.019079 | NaN | 0 | 1 | 0 | 0 | 0 | 810 | 3457.064063 | 4054.908412 | 3022.261812 | 1921 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 110 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 31.0 | 1 | 200.0 | 0 | 1 | 0 | 2.047904 | 0 | 0 | 1 | 0 | 0 | 0 | 1.0 | 0 | 7 |
| 4 | 1 | 0.9 | 0.596414 | NaN | 218.0 | 50 | 0.004441 | NaN | 0 | 1 | 0 | 0 | 0 | 890 | 5020.341679 | 2728.237159 | 3087.670952 | 1990 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 295 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 31.0 | 0 | 1500.0 | 0 | 1 | 0 | 3.775225 | 0 | 0 | 0 | 1 | 0 | 1 | 1.0 | 0 | 7 |
Escalado de variables¶
# Separate the feature matrix and target variable
X = df_fraud.drop(['fraud_bool'], axis=1)
y = df_fraud['fraud_bool']
X.head()
| income | name_email_similarity | prev_address_months_count | current_address_months_count | customer_age | days_since_request | intended_balcon_amount | payment_type_1 | payment_type_2 | payment_type_3 | payment_type_4 | payment_type_5 | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | date_of_birth_distinct_emails_4w | employment_status_1 | employment_status_2 | employment_status_3 | employment_status_4 | employment_status_5 | employment_status_6 | employment_status_7 | credit_risk_score | email_is_free | housing_status_1 | housing_status_2 | housing_status_3 | housing_status_4 | housing_status_5 | housing_status_6 | housing_status_7 | phone_home_valid | phone_mobile_valid | bank_months_count | has_other_cards | proposed_credit_limit | foreign_request | source_1 | source_2 | session_length_in_minutes | device_os_1 | device_os_2 | device_os_3 | device_os_4 | device_os_5 | keep_alive_session | device_distinct_emails_8w | device_fraud_count | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.9 | 0.166828 | NaN | 88.0 | 50 | 0.020925 | NaN | 1 | 0 | 0 | 0 | 0 | 769 | 10650.765523 | 3134.319630 | 3863.647740 | 1 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 185 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 24.0 | 0 | 500.0 | 0 | 1 | 0 | 3.888115 | 1 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 7 |
| 1 | 0.9 | 0.296286 | NaN | 144.0 | 50 | 0.005418 | NaN | 0 | 1 | 0 | 0 | 0 | 366 | 534.047319 | 2670.918292 | 3124.298166 | 718 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 259 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15.0 | 0 | 1500.0 | 0 | 1 | 0 | 31.798819 | 1 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 7 |
| 2 | 0.9 | 0.044985 | NaN | 132.0 | 40 | 3.108549 | NaN | 0 | 0 | 1 | 0 | 0 | 870 | 4048.534263 | 2893.621498 | 3159.590679 | 1 | 14 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 177 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NaN | 0 | 200.0 | 0 | 1 | 0 | 4.728705 | 0 | 1 | 0 | 0 | 0 | 0 | 1.0 | 0 | 7 |
| 3 | 0.9 | 0.159511 | NaN | 22.0 | 50 | 0.019079 | NaN | 0 | 1 | 0 | 0 | 0 | 810 | 3457.064063 | 4054.908412 | 3022.261812 | 1921 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 110 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 31.0 | 1 | 200.0 | 0 | 1 | 0 | 2.047904 | 0 | 0 | 1 | 0 | 0 | 0 | 1.0 | 0 | 7 |
| 4 | 0.9 | 0.596414 | NaN | 218.0 | 50 | 0.004441 | NaN | 0 | 1 | 0 | 0 | 0 | 890 | 5020.341679 | 2728.237159 | 3087.670952 | 1990 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 295 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 31.0 | 0 | 1500.0 | 0 | 1 | 0 | 3.775225 | 0 | 0 | 0 | 1 | 0 | 1 | 1.0 | 0 | 7 |
y.head()
0 1 1 1 2 1 3 1 4 1 Name: fraud_bool, dtype: int64
Separamos entre la X y la Y para estudiar mejor las variables y hacer su escalado
X_modeled = X[list_var_num]
scaler = StandardScaler()
model_scaled = scaler.fit(X_modeled)
X_scaled = pd.DataFrame(scaler.transform(X_modeled), columns=X_modeled.columns, index=X_modeled.index)
X.drop(columns=list_var_num, inplace=True)
X = pd.concat([X, X_scaled], axis=1)
X
| income | customer_age | payment_type_1 | payment_type_2 | payment_type_3 | payment_type_4 | payment_type_5 | employment_status_1 | employment_status_2 | employment_status_3 | employment_status_4 | employment_status_5 | employment_status_6 | employment_status_7 | email_is_free | housing_status_1 | housing_status_2 | housing_status_3 | housing_status_4 | housing_status_5 | housing_status_6 | housing_status_7 | phone_home_valid | phone_mobile_valid | has_other_cards | foreign_request | source_1 | source_2 | device_os_1 | device_os_2 | device_os_3 | device_os_4 | device_os_5 | keep_alive_session | device_distinct_emails_8w | device_fraud_count | month | name_email_similarity | prev_address_months_count | current_address_months_count | days_since_request | intended_balcon_amount | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | date_of_birth_distinct_emails_4w | credit_risk_score | bank_months_count | proposed_credit_limit | session_length_in_minutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.9 | 50 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 7 | -1.130538 | NaN | 0.011740 | -0.186699 | NaN | -0.799396 | 1.656644 | -1.105631 | -1.079180 | -0.398938 | -0.696005 | 0.775101 | 0.792636 | -0.032511 | -0.457429 |
| 1 | 0.9 | 50 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 7 | -0.682778 | NaN | 0.645158 | -0.189580 | NaN | -1.200242 | -1.705086 | -1.418907 | -1.882957 | 1.161029 | -1.291978 | 1.837071 | 0.011917 | 2.018520 | 3.017502 |
| 2 | 0.9 | 40 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1.0 | 0 | 7 | -1.551956 | NaN | 0.509426 | 0.387014 | NaN | -0.698936 | -0.537241 | -1.268351 | -1.844589 | -0.398938 | 0.893255 | 0.660293 | NaN | -0.647820 | -0.352774 |
| 3 | 0.9 | 50 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1.0 | 0 | 7 | -1.155844 | NaN | -0.734788 | -0.187041 | NaN | -0.758615 | -0.733783 | -0.483280 | -1.993885 | 3.778380 | -0.696005 | -0.301221 | 1.399862 | -0.647820 | -0.686538 |
| 4 | 0.9 | 50 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1.0 | 0 | 7 | 0.355278 | NaN | 1.482175 | -0.189762 | NaN | -0.679043 | -0.214315 | -1.380157 | -1.922776 | 3.928502 | -1.490635 | 2.353706 | 1.399862 | 2.018520 | -0.471484 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 999995 | 0.6 | 40 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1.0 | 0 | 4 | -1.041291 | NaN | 0.192717 | -0.184902 | NaN | -0.764583 | 0.744478 | 2.414588 | 0.126448 | -0.398938 | -0.298690 | -0.803504 | 0.879383 | -0.647820 | 0.118191 |
| 999996 | 0.8 | 50 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 | 4 | -0.590419 | 1.372801 | -0.881832 | 0.111935 | NaN | 1.724043 | -0.090991 | 0.125329 | 0.180905 | -0.401114 | -1.490635 | 0.330221 | NaN | -0.647820 | 0.175009 |
| 999997 | 0.8 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1.0 | 0 | 4 | 1.334058 | NaN | -0.644300 | -0.187137 | -0.081325 | -0.050421 | 0.796778 | 0.609022 | -0.520880 | 4.000300 | -0.696005 | -0.961365 | -0.335069 | -0.647820 | 0.078852 |
| 999998 | 0.9 | 20 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1.0 | 0 | 4 | 0.928382 | NaN | 1.154155 | -0.187734 | 2.499417 | -0.153865 | 0.806593 | -0.532175 | -0.501738 | 3.249688 | -0.696005 | 0.459380 | 1.139622 | -0.032511 | -0.401659 |
| 999999 | 0.2 | 20 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1.0 | 0 | 4 | 0.704742 | NaN | 2.647211 | 0.302910 | -1.147918 | -0.618369 | 0.167587 | -0.726383 | -0.547908 | -0.396762 | 0.495940 | -1.363191 | 0.011917 | -0.647820 | -0.105225 |
1000000 rows × 52 columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 44)
y_train = pd.DataFrame(y_train, columns=['fraud_bool'])
y_test = pd.DataFrame(y_test, columns=['fraud_bool'])
df_fraud.to_csv("./data_preprocesada/df_fraud_initial_preprocessing.csv")
X_train.to_csv("./data_preprocesada/X_train_preprocessing.csv")
X_test.to_csv("./data_preprocesada/X_test_preprocessing.csv")
y_train.to_csv("./data_preprocesada/y_train_preprocessing.csv")
y_test.to_csv("./data_preprocesada/y_test_preprocessing.csv")
Lo primeros pasos a realizar son la carga y entendimiento los datos. Esta sección incluye la carga del dataset desde un archivo CSV y una inspección inicial de los datos para entender su estructura y contenido básico.
Tras esto realizamos la importación de funciones, en la que se importaron funciones que son necesarias para el análisis de datos, lo que sugiere una preparación adecuada para un análisis estructurado y eficiente.
En tercer lugar, realizamos el análisis general, vemos los valores Missing/Nulos y de la variable objetivo, en esta sección se ha realizado un análisis descriptivo, incluyendo la identificación y tratamiento de valores faltantes, además de un enfoque en la variable objetivo fraud_bool.
Tras esto, realizamos el tratamiento de valores -1 como NaN. Se trata de una práctica común para manejar marcadores de datos faltantes o inválidos, lo que mejora la calidad de los datos para análisis posteriores.
Más adelante realizamos el tratamiento de outliers, en la cual esta sección aborda la identificación y el manejo de valores atípicos, que es crucial para evitar distorsiones en el análisis y en los modelos predictivos.
Como se puede observar, en esta parte tenemos una gran cantidad de gráficos con diferentes variables, pero vamos a comentar los 2 que mas interesantes nos han parecido.
Primero, en el gráfico de la variable velocity_24h muestra una clara diferencia en la distribución esta variable entre las transacciones fraudulentas y las no fraudulentas. Parece que para las transacciones no fraudulentas, los valores de velocity_24h están concentrados cerca de cero, mientras que las transacciones fraudulentas tienen una distribución más amplia con picos en valores más altos. Esto podría indicar que la actividad en las últimas 24 horas (posiblemente la velocidad de las transacciones o la cantidad de actividad) es un indicador significativo de fraude.
Por otro lado, tenemso el gráfico de la variable 'days_since_request', el cual también muestra diferencias en la distribución entre las transacciones fraudulentas y no fraudulentas. Las transacciones no fraudulentas tienen una concentración más alta en valores bajos, lo que sugiere que suelen ser más recientes. Por otro lado, las transacciones fraudulentas tienen una distribución más extendida y menos picos, lo que podría sugerir que el fraude se asocia con solicitudes hechas en un rango más amplio de tiempo.
Después de realizar el tratamiento de outliers, hemos hecho el tratamiento de variables categóricas y continuas en el que se incluyen técnicas para tratar adecuadamente con diferentes tipos de variables, lo cual es esencial para la preparación de datos en modelado predictivo.
Más adelante realizamos las correlaciones, en el cual exploramos cómo las diferentes variables se relacionan entre sí, lo cual revela insights importantes y ayudar en la selección de características para el modelado.
En el gráfico que enccontramos en esta parte es una matriz de correlación que muestra cómo las variables numéricas en un conjunto de datos están relacionadas entre sí. Las áreas amarillas indican una correlación positiva moderada, especialmente entre velocity_6h y velocity_24h, lo que sugiere que estas variables tienden a moverse juntas. La mayoría de las variables no muestran correlaciones fuertes, lo que es bueno para evitar multicolinealidad en modelos predictivos.
Una de las últimas partes realizadas ha sido la utilización de la técnica One-Hot Encoding. Esta técnica es utilizada para convertir variables categóricas en un formato que los modelos de Machine Learning pueden utilizar de manera más efectiva.
Por último, escalado de variables y la estandarización y oversampling de variables. El escalado es una parte crucial de la preparación de datos, especialmente para algoritmos que son sensibles a la escala de las características. En la estandarización y oversampling de variables ajustamos la distribución de las variables y abordamos el desequilibrio en los datos, lo cual es particularmente relevante en conjuntos de datos con clases desbalanceadas, como lo son en casos de detección de fraude.